Journaling database changes using a bit map for zones defined in each page

ABSTRACT

The disclosure and claims herein are directed to efficient journaling for recovery of a database index by journaling zones of a page. A journal mechanism maintains a page zone bit map that includes a bit for a plurality of zones in each page to indicate which zones have had their unchanged image journaled before being changed since a last sync point update. The page zone bit map has a bit for each zone in each page so that the status of each zone can be tracked separately. Tracking the smaller zones of the pages makes the process more efficient both at run time and during recovery by reducing the period of time for memory deposits and reducing the amount of total redundant/recovery data sent to disk for larger pages.

CROSS-REFERENCE TO PARENT APPLICATION

This patent application is a divisional of U.S. Ser. No. 12/261,097filed on Oct. 30, 2008, which is incorporated herein by reference.

BACKGROUND

1. Technical Field

This disclosure generally relates to system recovery in a computerdatabase system, and more specifically relates to journaling databasechanges using a bit map for zones defined in each page of large pageindexes in a structured query language index.

2. Background Art

Computer databases typically contain data space entries, or records,plus indexes that provide ordered lists of the data space entries basedon key values contained in the data space entries. When changes are madeto the entries in a data space(s), the corresponding database indexesover the data space may need to be updated in order to keep the indexessynchronized with respect to the data space they cover. Often thechanges to the database index(es) are made first, followed by thechanges to the data space. This order of changes is chosen to allow anyconditions that would prevent the updating of the database indexes tosurface before a data space is changed. The attempt to insert aduplicate key into a unique index is one such condition. When the systemterminates abnormally, the data spaces and the database indexes relatingthereto may not be synchronized. Some transactions may have causeddatabase index(es) to be updated, but the associated data space entriesmay not have been updated at the time the system terminated.

Journaling of transactions which cause a change in a database is a wellknown technique, and is described in detail in the following references:U.S. Pat. No. 4,819,156 to DeLorme et al., and U.S. Pat. No. 5,574,897to Hermsmeier et al. These prior art approaches were developed when thesize of the logical pages being logged as virgin images within indexeswas rather modest. Current operating systems (such as i5/OS byInternational Business Machines Corporation (IBM)) provide customizedlogical page sizes for indexes that can vary from 4 k up to 512K bytes.Larger logical page sizes often improve query performance because theyincrease the locality of reference, reduce the number of off-pagetraversals, and reduce the total number of disk-to-memory transfersrequired in order to satisfy the query operation. However, this queryimprovement comes at a price and that price often affects run-time indexmaintenance overhead as well as an increase in high availabilityrecovery time. Each time a key is added or removed from an index thesurrounding software query language (SQL) index is placed at risk fromloss and this at-risk condition is mitigated by logging/journaling theso-called virgin/before image of the entire logical leaf page (see thepatents cited above). The larger the leaves of the index, the moreoverhead, the larger the main memory footprint, the more churn and thegreater the disk traffic associated with such index logging. This putssystem administrators with a dilemma to select between better queryperformance or increased index maintenance for high availabilityrecovery.

The prior art algorithms for journaling indexes break down when largelogical leaf page sizes are employed and the resulting performancesuffers. Disk write traffic soars and gate contention duration rises asincreasingly larger quantities of bytes are being managed. A trimmerapproach is needed which doesn't flood the disk with so many bytes onbehalf of the before/virgin images of SQL indexes.

Without a way to more efficiently journal the affected areas of theselarger page sizes by using a smaller footprint, system administratorswill continue to be forced to choose between better query performanceand fast recovery from failures in a computer database.

BRIEF SUMMARY

The disclosure and claims herein are directed to efficient journalingfor recovery of a database by journaling zones of a page. As describedherein, a journal mechanism maintains a page zone bit map that includesa bit for a plurality of zones in each page to indicate which zones havehad their unchanged image journaled before being changed since a lastsync point update. The page zone bit map has a bit for each zone on apage so that the status of each zone can be tracked separately. Trackingthe smaller zones of the pages makes the process more efficient both atrun time and during recovery by reducing the period of time otheroperations are held at bay by locks and gates for memory deposits andreducing the amount of total data sent to disk for larger pages.

The foregoing and other features and advantages will be apparent fromthe following more particular description, as illustrated in theaccompanying drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)

The disclosure will be described in conjunction with the appendeddrawings, where like designations denote like elements, and:

FIG. 1 is a block diagram of an apparatus with a journal mechanism forefficient journaling of a database index;

FIG. 2 is a more detailed block diagram of the journal mechanism in thecomputer system;

FIG. 3 is a block diagram of a page zone bit map;

FIGS. 4 and 5 illustrate an example of implementing a page zone bit map;

FIG. 6 is a method flow diagram for using a page zone bit map; and

FIG. 7 is another method flow diagram for using a page zone bit map.

DETAILED DESCRIPTION

1.0 Overview

The present invention relates to efficient journaling for recovery of adatabase by journaling zones of a leaf page for a database index. Forthose not familiar with the concepts of journaling of database indexes,this Overview section will provide background information that will helpto understand the present invention.

Databases may be comprised of data spaces that contain data spaceentries, or records, and database indexes that provide ordered lists ofdata space entries, based on key values contained in the data spaceentries. When changes are made to the entries in a data space(s),database indexes over the data space may need to be updated, in order tokeep the indexes synchronized with respect to the data space they cover.In the IBM iSeries, the changes to the database index(es) are madefirst, followed by the changes to the data space. This order of changesis chosen to allow any conditions that would prevent the updating of thedatabase indexes to surface before a data space is changed. The attemptto insert a duplicate key into a unique index is one such condition.

When the system terminates abnormally, the data spaces and the databaseindexes relating thereto may not be synchronized. Some transactions mayhave caused database index(es) to be updated, but the associated dataspace entries may not have been updated at the time the systemterminated. To further complicate matters, in a virtual storageenvironment with paging, the paging routine may not have written thechanged pages for either the data space or the associated databaseindex(es) to nonvolatile storage, or it may have only written some ofthe changed pages for either the data space or the database index(es) tononvolatile storage at the time of a failure. If some, but not all, ofthe changed pages for a database index were written to nonvolatilestorage before an abnormal termination, the logical structure of theindex that is available from nonvolatile storage after termination maybe sufficiently inconsistent so as to preclude use of the index, even asa starting point for forward recovery (using a journal of data spaceentry changes).

Journaling transactions to a database works well for recovery of thedata space, because it is only necessary to journal the image of eachdata space entry before and after each change. Each data space entry islocalized at a fixed position within the data space, so few pages arechanged when a data space entry is updated.

Journaling the changes to the database indexes relating to a data spaceis more complex because, depending on the type of data structure usedfor the index, a change to a single entry in an index may requirechanges to many logical pages in the index. Many popular indexstructures, such as binary radix trees and B-trees, exhibit thecharacteristic that a change to a single entry can require changesdistributed through many logical pages of the index. An approach ofjournaling all changes to a database index may require so many pages tobe journaled for each change of a data space entry that the techniquecannot be used because of the very large storage requirements for thejournal or because the performance cost of the required journal activitymay be prohibitive.

Database indexes typically comprise binary radix tree indexes definedover data spaces. Journaling of unchanged index pages is also beneficialwith other implementations of database indexes, such as B-Trees. Awrite-ahead journal is used to reflect all changes to a data spacebefore the data space entries are actually changed. Changed index pagesare not allowed to be written to auxiliary storage until theircorresponding unchanged page images have been written to a journal onauxiliary storage. Thus, the journal on auxiliary storage alwayscontains information that corresponds to the most recent changes to thejournaled database indexes and data spaces, even before the indexes anddata spaces are changed on auxiliary storage.

Unchanged database index pages are copied to a buffer in main storagebefore they are written to the journal on auxiliary storage. The bufferin main storage is not forced to be written to auxiliary storage untilthe before/virgin images of all database index pages to be changed andthe changed and unchanged data space entry changes are added to thebuffer. Allowing the journal information to accumulate in a main storagebuffer reduces the number of I/O operations necessary to write theinformation to auxiliary storage, which can improve performance. Becausethis procedure allows the database index pages to be changed in mainstorage before the unchanged/virgin index pages are written to thejournal on auxiliary storage, it is necessary to provide a mechanism tomake sure that the write operation(s) for the journal are completedbefore the write operation(s) for the database index(es) are initiated.

The database indexes, data spaces, and journal reside on pages in avirtual storage environment. When a page from virtual storage is pinnedin mainstore, the storage management mechanism of the system is notallowed to write the page to auxiliary storage or to re-assign themainstore page frame to a different virtual page. The write operationsto auxiliary storage are ordered by pinning any pages in a databaseindex from just before the page is changed for the first time (in mainstorage) until after the unchanged page image is written to the journalon auxiliary storage. Other mechanisms are possible to ensure that thejournal is updated before the database index on auxiliary storage, andsuch mechanisms are considered to be within the scope of the disclosureand claims herein.

A journal sync point is a marker, or pointer, which is associated with aparticular journaled database index or data space, and which identifiesthe oldest entry in the journal that is needed to recover the associatedjournaled object after an abnormal termination. Each journaled databaseindex and data space has its own sync point. The sync point can beviewed as the position in the journal that corresponds to the last (mostrecent) time when the state of the journaled database index or dataspace on auxiliary storage was known to be at a completely reliable andconsistent state. The sync point for a journaled object is updated toreference a different journal entry whenever all pending changes for theobject (database index or data space) are forced to be written from mainstorage to auxiliary storage.

The recovery of a journaled database index after an abnormal terminationrelies on the ability to return the index to some completely consistentstate, and then re-processes changes to bring the index up to date withrespect to the data spaces it covers. Since the journal sync point for adatabase index identifies a point where the index is in a consistentstate, the recovery process needs to restore the state of the index atthe time when the associated journal sync point was last updated. Inorder to return the index to its state at the last sync point, thejournal must contain at least the unchanged/virgin images of everydatabase index page that was changed in response to a change in one ofthe data spaces the index covers. By capturing the so-called “before”image of such index pages, this state can be reconstructed.

Preferably only the images of unchanged database index pages are savedin the journal. Once the image of an unchanged page in a database indexhas been added to the journal, no additional journal entries arerequired for that page until after the next sync point update,regardless of how many times an individual page is updated. Thus, ifmultiple changes occur between sync point updates to the same pages ofthe database index, there is no need to gather and save the contents ofindex pages that may contain complex and redundant changes. This initialcapture is known as the virgin image of the index page. By capturingonly the virgin image (not subsequent images) of the page, substantialspace saving ensue. Other techniques are possible, such as saving theimage of every database index page before every change. The preferredembodiment reduces the number of auxiliary storage I/O operations andthe amount of auxiliary storage required, if multiple changes are madebetween sync point updates to the same database index page(s). If theimages residing with the journal are transported to a second server soas to provide redundancy to help assure protection against outages, thecommunication traffic is reduced as well.

A mechanism is required to record which index pages remain unchangedsince the last sync point, and which pages have had their unchangedimages journaled before they were changed. Typically, a bit map isassociated with the database index to determine which pages have beenjournaled and changed since the last sync point update. In the priorart, each bit in the map represents a single logical page in the index,and there is a separate bit map for each journaled database index. Allthe bits in the map for a journaled index are cleared (set to zero) whenthe index sync point is updated. The unchanged image of a database indexpage that has not been changed since the last sync point update iscalled a “virgin” page image. Before a page in the index is changed, thecorresponding bit is tested to determine whether the page is still avirgin page. If the bit is reset (zero), the virgin image of the page isadded to the journal, the bit is set (to one), the page is pinned inmainstore, and then the page is changed. If the bit is already set (toone) when a page must be changed, the page is just updated (withoutjournaling or pinning it in mainstore). Other techniques are possible todistinguish between virgin pages and index pages that have been changedsince the last sync point.

A list of all the database index pages that are currently pinned isupdated to add an entry every time an index page is pinned (before it isupdated in main storage). After unchanged and changed images of theassociated data space entry are added to the journal and the journal isforced to be written to auxiliary storage, all the pages in the list areunpinned (which allows the pages to be written by the system storagemanagement means to auxiliary storage), and all entries are removed fromthe list of pinned pages.

The journal sync point for a database index is updated occasionally, inorder to limit the number of journal entries that must be used torecover after an abnormal termination. The more journal entries allowedbetween sync point updates for database indexes, the more journalentries that may need to be read from auxiliary storage and processedafter an abnormal termination, and the longer recovery may take. Aparameter is provided to allow the database user to control howfrequently the sync points for database indexes are updated.

To recover database indexes and data spaces, the appropriate journalentries appearing after sync points for each object are applied to theindexes and data spaces. The sync points for indexes need not be thesame as for data spaces. This is beneficial because it allows the systemto avoid writing to auxiliary storage, at the same time, all the changedpages for database indexes and the data spaces they cover. The I/Ooperations required to write multiple objects to auxiliary storage couldhave severe performance impacts on the rest of the system. All objectsin the set of database indexes and the data spaces they cover need notbe synchronized (written to auxiliary storage) in unison in order tosynchronize any one object.

To recover a data space or index, the entries on the journal (generatedby transactions against the database being journaled) since the latestsync point for each object, are applied to the appropriate data space orindex. The first step is to apply all journaled virgin images to thedatabase index, to return the index to the consistent state that existedfor the last sync point. The next step is to apply all journaled changesto the data space(s), and to record index changes that will be requiredto bring the database index(es) up to date. The final step is to applythe recorded changes to the index, which updates the index from itsstate at the last sync point to the state that corresponds with the last(newest) entry in the journal.

2.0 Detailed Description

The claims and disclosure herein provide for efficient recovery of adatabase by journaling zones of a page. A bit map of zones within eachpage is maintained so that the status of each zone can be tracked.Instead of tracking the changes at the logical page level as done in theprior art, the underlying machine index support in the operating systemtracks a zone of the leaf page that could be much smaller than the pagesize. The same logging mechanisms and recovery mechanisms outlined inthe patents cited above can be used except that zones would be managedinstead of full page leaves. Tracking the smaller zones of the leafpages makes the process more efficient both at run time and duringrecovery by reducing the period of time locks are held for memorydeposits and reducing the amount of total data sent to disk for largerpages. By tracking at this more granular level (a sub-page) thoseapplications which tend to have a scattered locality of referencepattern within the SQL index at index maintenance time would movesubstantially fewer bytes into the underlying journal. That would reduceboth the run time burden and the size of the main memory footprint aswell as speed up the recovery processing. The journal would flag thesevirgin image deposits as mere zones and replay those zones during asubsequent abnormal initial program load (IPL) by feeding the zoneimages back to the OS, who would overlay the matching zone on the diskwith the virgin image harvested from the journal. The space andperformance benefits would be most substantial for applications andindexes where the reference pattern tends to have little locality ofreference (telephone number updates for example).

Referring to FIG. 1, a computer system 100 is one suitableimplementation of a computer system that includes a journal mechanism tofacilitate efficient processing and recovery of a database structuresuch as a database index. Computer system 100 is an IBM iSeries computersystem. However, those skilled in the art will appreciate that thedisclosure herein applies equally to any computer system, regardless ofwhether the computer system is a complicated multi-user computingapparatus, a single user workstation, or an embedded control system. Asshown in FIG. 1, computer system 100 comprises one or more processors110, a main memory 120, a mass storage interface 130, a displayinterface 140, and a network interface 150. These system components areinterconnected through the use of a system bus 160. Mass storageinterface 130 is used to connect mass storage devices, such as directaccess storage devices 155, to computer system 100. One specific type ofdirect access storage device 155 a is a readable and writable CD-RWdrive, which may store data to and read data from a CD-RW 195. Anothertype of direct access storage device 155 b is a readable media such as adisk drive which stores a journal 156, data pages 157 and index pages158 as described further below.

Main memory 120 preferably contains an operating system 121. Operatingsystem 121 is a multitasking operating system known in the industry asi5/OS; however, those skilled in the art will appreciate that the spiritand scope of this disclosure is not limited to any one operating system.The memory includes a paging mechanism 122. The memory further includesa journal mechanism 123 that contains a page zone bit map 124, a pinnedpage list 125, a journal buffer 126 and a key mapping 127. The memoryfurther contains data space pages 128 and index pages 129. Each of theseentities in memory is described further below.

Computer system 100 utilizes well known virtual addressing mechanismsthat allow the programs of computer system 100 to behave as if they onlyhave access to a large, single storage entity instead of access tomultiple, smaller storage entities such as main memory 120 and DASDdevice 155. Therefore, while operating system 121, paging mechanism 122,journal mechanism 123, page zone bit map 124, pinned page list 125,journal buffer 126, key mapping 127, data space pages 128 and indexpages 129 are shown to reside in main memory 120, those skilled in theart will recognize that these items are not necessarily all completelycontained in main memory 120 at the same time. It should also be notedthat the term “memory” is used herein generically to refer to the entirevirtual memory of computer system 100, and may include the virtualmemory of other computer systems coupled to computer system 100.

Processor 110 may be constructed from one or more microprocessors and/orintegrated circuits. Processor 110 executes program instructions storedin main memory 120. Main memory 120 stores programs and data thatprocessor 110 may access. When computer system 100 starts up, processor110 initially executes the program instructions that make up operatingsystem 121.

Although computer system 100 is shown to contain only a single processorand a single system bus, those skilled in the art will appreciate that amemory migration mechanism may be practiced using a computer system thathas multiple processors and/or multiple buses. In addition, theinterfaces that are used preferably each include separate, fullyprogrammed microprocessors that are used to off-load compute-intensiveprocessing from processor 110. However, those skilled in the art willappreciate that these functions may be performed using I/O adapters aswell.

Display interface 140 is used to directly connect one or more displays165 to computer system 100. These displays 165, which may benon-intelligent (i.e., dumb) terminals or fully programmableworkstations, are used to provide system administrators and users theability to communicate with computer system 100. Note, however, thatwhile display interface 140 is provided to support communication withone or more displays 165, computer system 100 does not necessarilyrequire a display 165, because all needed interaction with users andother processes may occur via network interface 150.

Network interface 150 is used to connect computer system 100 to othercomputer systems or workstations 175 via network 170. Network interface150 broadly represents any suitable way to interconnect electronicdevices, regardless of whether the network 170 comprises present-dayanalog and/or digital techniques or via some networking mechanism of thefuture. In addition, many different network protocols can be used toimplement a network. These protocols are specialized computer programsthat allow computers to communicate across a network. TCP/IP(Transmission Control Protocol/Internet Protocol) is an example of asuitable network protocol.

At this point, it is important to note that while the description aboveis in the context of a fully functional computer system, those skilledin the art will appreciate that the journal mechanism described hereinmay be distributed as an article of manufacture in a variety of forms,and the claims extend to all suitable types of computer-readable mediaused to actually carry out the distribution, including recordable mediasuch as floppy disks and CD-RW (e.g., 195 of FIG. 1).

Embodiments herein may also be delivered as part of a service engagementwith a client corporation, nonprofit organization, government entity,internal organizational structure, or the like. These embodiments mayinclude configuring a computer system to perform some or all of themethods described herein, and deploying software, hardware, and webservices that implement some or all of the methods described herein.

Again referring to FIG. 1, the paging mechanism 122 in main memory 120is a mechanism for storing pages of data in main memory. The pages ofdata for example may be 4 k byte pages of data which are paged in andout of volatile main memory 120 by the processor 110 which implementswell known paging routines. The pages are stored on nonvolatileauxiliary storage units which are usually disk drive devices such asDASD 155 b. The data residing in main memory 120 and on disk drives 155b comprises a plurality of databases, consisting of a combination ofdata spaces pages 128, 157, index pages 129, 158, journal buffer 126 andjournal 156. The indexes provide different views of the data spaces.Data spaces and indexes are also both referred to as objects.

Referring now to FIG. 2, one specific implementation for the databaseindex journal mechanism 123 in FIG. 1 is shown as 200. The journalingmechanism 200 comprises a portion of main memory 120 shown in FIG. 1 forjournaling a recovery copy of pages of data. While the journal buffer126, data space pages 128 and index pages 129 are shown pictorially inone block, portions of these blocks physically reside in multiple pagesof main memory 120 and the remaining portions of these blocks are storedin non-volatile memory on DASD 155 b. Unless otherwise noted, referencesto these blocks using the designations in FIG. 2 include the pages inboth locations.

The index pages 129 contain keys relating to data on data space pages128. The keys are typically organized in a binary radix tree. Furtherinformation on keys and binary radix trees is found in Howard andBorgendale, “System/38 Machine Indexing Support”, IBM System/38Technical Developments, 1978. (IBM Form G580-0237) Mapping between thedata space pages 128 and the index pages 129 is provided by a keymapping block 127 which contains information necessary to transform datafrom a record in the data space 128 into a corresponding key in theindex stored in index pages 129.

Copies of changes to be made to the data space pages 128 are buffered ina journal buffer 126. The journaled changes in the journal buffer 126are written out to auxiliary storage (DASD 155 b) prior to the changesbeing made on the data space pages. This is commonly known as awrite-ahead journal.

Whenever a journaled data space is forced (forced to be written toauxiliary storage in its entirety), a sync point is marked on thejournal for that data space. A sync point is a marker representing apoint in time at which all previously altered pages of the journaledobject have been written from volatile main storage to non-volatileauxiliary store.

Each time a new sync point is established the recovery processingmechanism can limit processing time by ignoring previous journaleddeposits on behalf of the synchronized object. Consequently, thismechanism ensures that recent changes to the data space pages can berecovered in the event of system termination by merely employing thejournaled images recorded subsequent to the sync point.

In addition to journaling the changes to the data space pages, a copy ofindex page zones to be changed is journaled prior to changing the indexpages. Pages to be changed are identified as follows. Every indexoperation that changes an index (either an insert or remove) provides akey to be inserted or deleted. This key is used to search the index tofind the point of change in the index. Thus after an initial search ofthe index, the page(s) which change in response to a data space changeare located. Journaling the changes is accomplished by sending the pagezone image to the journal if it is a virgin image.

The fact that an index page zone has been journaled is indicated in apage zone bit map 124 which contains a separate distinct bit positionfor each zone of an index page. If more changes are to be made to thezone of the index page before a sync point occurs for the index page,the corresponding bit position in the bit map 124 is examined. If thebit is on, the index changes are made without journaling the zone of theindex page again. (See FIG. 3 and the corresponding description below.)

Changed index pages not already journaled since the last sync point arepinned and tracked in a pinned page list 125. The page is pinned beforethe virgin index page zone is sent to journal buffer 126. The presenceof this pin prevents this page from being written out by normal virtualmemory paging functions. After the page zone is sent to the journalbuffer 126, the changes to the index pages are made.

The changes to the data spaces are reflected on the journal buffer 126and are then written synchronously via a storage management function toauxiliary storage. In the illustrated example herein, the virgin indexpage zones are also written at the same time. They piggyback out toauxiliary storage with the changes to the data spaces. Thus, bothvarieties of journal deposits are bundled into a single packet of bytes,hence there is no extra I/O operation required to journal the indexother than that required to journal the data space alone.

The pins on the now changed virgin index pages are pulled (the pinnedpage list 125 is used to identify these pages), via a request to storagemanagement. This allows the altered index page images to againparticipate in normal paging activity. The pages are also removed fromthe pinned page list 125. The changes to the data spaces are also madefollowing the synchronous write of the journal buffer. The above orderensures that any time the system crashes with loss of main storagecontent, the data spaces and indexes can be reconstructed purely fromimages resident on the journal.

Periodically objects being journaled are synchronized. A selectionmechanism forces the object with the oldest (earliest) sync point toauxiliary storage every n journal entries, where n is a value selectedto strike a balance between recovery time and performance overheadaccompanying the sync point mechanism. It is referred to as a recoveryconstant.

Synchronization of the oldest object serves to limit the length of therecovery time by ensuring that during recovery (after a machine failure)the journal need not be processed further back than the final n entriesresiding on the journal. The recovery constant insures that no objecthas a sync point more than n entries from the end of the journal.

FIG. 3 shows additional detail of a page zone bit map 124. The page zonebit map 124 has a bit for each zone of each page of the index pages 129.As shown in FIG. 3, bit 1 (at 310) corresponds to zone 1 page 1 (at320), and bit 2 (at 312) corresponds to zone 2 page 1 (at 322). Theremay be other zones in page 1 that are not shown. Similarly, bit 3 (at314) corresponds to zone 1 page 2 (at 324), an bit 4 (at 316)corresponds to zone 2 page 2 (at 326). There may be any number of pageswith any number of zones such that bit n (at 318) corresponds to zone n,page n (at 328). The zone size compared to the index page size could bechosen to optimize different aspects of the performance resulting indifferent possible numbers of zones per index page size. Again, theremaining bits are not shown for simplicity.

Since what had formerly been treated as a single logical page (at least4 k and often as big as 512 k apiece) is now going to be viewed asbroken into zones, there are going to be times when a new key comesalong and finds that lots of the zones within the surrounding logicalpage are completely empty. In this case, if the new keys land in such anempty zone, it makes sense from an efficiency point of view to avoidcapturing a virgin image of that empty zone. Doing so, further helpsachieve the overall objective of minimizing the quantity of bytes whichare moved into the journal and ultimately written to disk. Blindlycapturing the “before”/virgin image of each zone without regard to thezone's status would wastefully store these empty zones. This would bewasteful by slowing down both run time and IPL/recovery time as well abloating the journal along the way. Thus, those zones whose virginstates have not yet been journaled/captured are preferably journaled thefirst time they are modified, because the “virgin” image of the zone isavailable. In our examples below, the journal mechanism would wait tojournal those zones with virgin states until the first time they aremodified. The fact that the zone bit isn't yet turned on signifies thatthe virgin state of this particular zone has not yet been captured.Periodically, set zones get “aged” and reset by the sync-point process,where after “n” journal entries have arrived, the oldest entries areremoved when the main memory resident “after” images of the zone arewritten to disk. Thus the matching bit for the zone may be zeroed orcleared for zones written to disk while establishing the new sync-point.

FIGS. 4 and 5 illustrate an example of journaling a logical page with aplurality of zones as described herein. In this example there is a pagezone bit map that has bits for each page, and zone bit map.Alternatively, there could be a single page zone bit map as describedabove in FIG. 3. FIG. 4 represents portions of an index 400. Index 400includes a logical page bit map 410 with a bit for each page (412, 414,416, 418), a logical zone bit map 420 housing a bit for each zone (422,424, 426, 428), and four logical pages (P1, P2, P3, P4) each 16k in size(only page P2 430 is shown). Index 400 further includes a journal buffer450 for holding journal entries that indicate populated portions oflogical pages before the image of the page is updated. Logical page P1is the trunk page and pages P2, P3, P4 are leaf pages. Each of thelogical pages is subdivided into zones with each zone 4 k in size. Hencepage P2 430 consists of zones Z1 432, Z2 434, Z3 436, and Z4 438. In ourexample, pre-existing key values C and D and E already reside in logicalpage P2 of the index. Key values C and D reside in Zone Z2 434 while keyvalue E resides in Zone Z3 436. For this example, a new row is to beadded to the data space such that the new row houses key value: “F”. Thenew key value “F” when added to the index will reside on Page P2 430 inZone Z3 436.

FIG. 5 illustrates the index 400 in FIG. 4 after adding the new keyvalue “F” 440. The new key value is added to the logical page asfollows. First, the index is navigated until the leaf level pagedestined to house the new key is identified. In this case the leaf pageP2 430 is identified. The corresponding bit for logical page P2 414 isthen set as shown to indicate page P2 has one or more zones that havebits set. Next, the logical page is pinned in a pinned page list toindicate that the new image of the page should not be allowed to reachthe disk until the matching journal entries that are about to beproduced have reached the disk. The zone(s) within the page which willbe modified as a result of storing the new key (“F”) within this logicalpage are identified. In our example, zone Z3 436 of logical page P2 430is identified. If the matching zone bit(s) are not already set, then thecorresponding zone bits are set to indicate which zones have had theirunchanged image journaled before being changed since a last sync pointupdate. In our example zone bit Z3 426 is turned on or set. Next, a copyof the populated portion of the affected zone is extracted. In thisexample, the key value “E” residing within Z3 436 is extracted. Then amatching new journal entry “E” 455 representing the virgin before imageof the populated portion of the affected zone is constructed with theextracted copy and placed in the journal buffer 450 as shown. The newkey value, “F” is then copied into the affected zone(s), which in thisexample is zone Z3 436.

FIG. 6 shows a method 600 for journaling zones of a leaf page for adatabase index. The steps in method 600 are preferably performed by thejournal mechanism 123 in combination with the paging mechanism 122 shownin FIG. 1. First, provide a database journal (step 610). Next, provide apage zone bit map with a bit for each zone within each page of thedatabase index (step 620). Then, journaling the database index using abit in the page zone bit map to indicate which zones have had theirunchanged image journaled to the database journal (step 630). The method600 is then done.

FIG. 7 shows another method 700 for journaling zones of a leaf page fora database index. The steps in method 700 are preferably performed bythe journal mechanism 123 in combination with the paging mechanism 122shown in FIG. 1. First, navigate the index until the leaf level pagedestined to house the new key to place in the index has been identified(step 710). Then, turn on the corresponding bit for the logical pageidentified (step 720). Pin the logical page (step 730). Next, identifythe zone(s) within the page which will be modified as a result ofstoring the new key within this logical page (step 740). If the matchingzone bit(s) is not already on (step 750=no), then turn on thecorresponding zone bit (step 760), extract a copy of the populatedportion of the affected zone (step 770) and construct a matching newjournal entry representing the virgin before image of the populatedportion of the affected zone (step 780). Then move the new key value(“F” in our example) into the affected zone(s) (step 790). If thematching zone bit(s) is already on (step 750=yes), then go to step 790.The method 700 is then done.

One skilled in the art will appreciate that many variations are possiblewithin the scope of the claims. Thus, while the disclosure isparticularly shown and described above, it will be understood by thoseskilled in the art that these and other changes in form and details maybe made therein without departing from the spirit and scope of theclaims.

1. An apparatus comprising: at least one processor; a memory coupled tothe at least one processor; a database residing in the memory, thedatabase comprising a plurality of pages that are each divided into aplurality of zones; a paging mechanism for storing the pages from memoryto an auxiliary storage; a page zone bit map residing in the memoryincluding a plurality of bits corresponding to one of the plurality ofpages, each bit in the page zone bit map corresponding to one of theplurality of zones in the corresponding one page, wherein a state of abit in the page zone bit map indicates whether any changes have beenmade to the one corresponding zone; and a journaling mechanism thattracks changes to the plurality of zones in the plurality of pages bychanging the state of the plurality of bits to journal a changed zone ina page.
 2. The apparatus of claim 1 wherein the page zone bit mapcomprises a page bit map with a bit for each page and a zone bit mapwith a bit for each zone in each page.
 3. The apparatus of claim 2wherein the journal mechanism uses the bit corresponding to the page bitmap to indicate that a bit in the zone bit map is set, and thejournaling mechanism uses the bit in the zone bit map to indicate whichzones have had their unchanged image journaled before being changedsince a last sync point update.
 4. The apparatus of claim 1 wherein thejournaling mechanism uses the bit in the page zone bit map to indicatewhich zones have had their unchanged image journaled before beingchanged since a last sync point update.
 5. The apparatus of claim 1wherein the journaling mechanism waits to journal zones with virginstates until the first time the zones are modified.
 6. Acomputer-implemented method for journaling zones of a leaf page for adatabase index, the method comprising the steps of: (A) providing adatabase with a plurality of pages divided into a plurality of zones;(B) providing a page zone bit map including a plurality of bitscorresponding to one of the plurality of pages, each bit in the pagezone bit map corresponding to one of the plurality of zones in thecorresponding one page, wherein a state of a bit in the page zone bitmap indicates whether any changes have been made to the onecorresponding zone; and (C) journaling the database using a bit in thepage zone bit map to indicate which zones have had their unchanged imagestored to the database journal.
 7. The method of claim 6 wherein thepage zone bit map further comprises a page bit map with a bit for eachpage and a zone bit map with a bit for each zone in each page.
 8. Themethod of claim 7 further comprising the steps of: using the bitcorresponding to the page bit map to indicate that a bit in the zone bitmap is set, and using the bit in the zone bit map to indicate whichzones have had their unchanged image journaled before being changedsince a last sync point update.
 9. The method of claim 7 furthercomprising the step of waiting to journal zones with virgin states untilthe first time the zones are modified.